Voice registration method and system, and voice recognition method and system based on voice registration method and system

ABSTRACT

Disclosed is a voice registration method for voice recognition, comprising the steps of analyzing a spectrum of a sound signal inputted from the outside; extracting predetermined language units for a speaker recognition from a voice signal in the sound signal; measuring the loudness of each language unit; collecting voice data on registered (background) speakers including loudness data of the plurality of background speakers as a reference onto voice database; determining whether the loudness of each language unit is within a predetermined loudness range based on the voice data base; learning each language unit by using a multi-layer perceptron in the case that at least a predetermined number of language units are within the predetermined loudness range; and storing data on the learned language unit as data for recognizing the speaker. With this configuration, loudness of a speaker is considered at learning for registering his/her voice and at verifying a speaker.

FIELD OF THE INVENTION

The present invention relates to in general a voice recognition methodand system based on the voice registration method and system, whichprevent an error due to the loudness of speaker's voice by performingvoice learning and voice recognition in consideration of the loudness ofspeaker's voice.

BACKGROUND ART

Generally, a security system has been mostly used for a nationalsecurity and an industrial security, but is recently used for a personalsecurity and a computer security.

Especially, the development of computer network systems includingInternet has brought the problem that a computer network system becomesincreasingly vulnerable to attack and therefore individual informationis likely to leak out through networking such as electronic commerce,the Internet, etc.

To prevent the problem, in the case of a computer system, there havebeen developed several methods for allowing only a specified person toaccess to the computer system. The methods may be classified into amethod using an ID, a password, a certification key, etc. and a methodusing a biological property. The biological property is comprised of avoice, a fingerprint, lines of a finger or a palm, a retinal pattern,etc.

The voice is a universal and simple means to express a human'sintention. As technologies using the voice, there have been proposed avoice recognition system for perceiving the voice, a speaker recognitionsystem for recognizing a speaker uttering the voice, etc.

In the speaker recognition system, a user does not need to use an ID anda password to prevent an illegal use. Further, only a sound card and amicrophone, which are generally provided in a personal computer system,are adequate to perform the speaker recognition system. Furthermore, inthe speaker recognition system, the personal computer system can becontrolled to operate in response to the voice of a specified person.

The speaker recognition may be classified into speaker identificationand speaker verification in terms of a recognition method. The speakeridentification is to identify a speaker of an inputted voice, and thespeaker verification is to accept or reject a speaker by verifying thevoice of the speaker.

A general process of the speaker recognition will be described asfollows.

First, if a speaker inputs his/her voice to a speaker recognition systemin order to register himself/herself, a waveform of the inputted voicesignal is represented as a spectrum. The spectrum is analyzed so as topick out an isolated word, thereby sampling phonemes from the word.Herein, the phonemes are predetermined so as to be employed as areference for recognizing the voice. Thereafter, the speaker recognitionsystem makes a pattern for each phoneme of a speaker, and subsequentlycompares it with patterns of the predetermined phonemes, therebylearning the speaker's characteristics. If the learning is completed,the speaker's pattern is registered.

Later on, if a voice is newly inputted to the speaker recognitionsystem, the speaker recognition system makes a pattern based on thenew-inputted voice through the above analyzing process, and subsequentlycompares it with the voice pattern of the registered(background)speaker, thereby accepting or rejecting the speaker.

In the conventional speaker recognition system, a new-made pattern iscompared to the voice pattern of the registered speaker stored in adatabase. However, the voice stored in the database is recorded underideal conditions such as little noise, a highly efficient microphone,the uniform loudness of voice, etc., and therefore the voice stored inthe database indicates only a special example of the actual voice.

In the case of inputting the voice uttered in the conditions differentfrom the voice stored in the database, the performance of the voicerecognition system is influenced. Particularly, the loudness of voicemakes a significant influence on the performance of the system.

Thus, in the voice recognition system, it is necessary to provide voicelearning and speaker verification in consideration of the influence ofthe loudness of voice.

DISCLOSURE OF INVENTION

Accordingly, the present invention has been made keeping in mind theabove-described shortcoming and user's need, and an object of thepresent invention is to provide a voice registration method and system,and a voice recognition method and system based on the voiceregistration method and system, which accurately verifies a speaker byperforming voice learning and speaker verification in consideration ofthe loudness of voice.

This and other objects of the present invention may be accomplished bythe provision of a voice registration method for voice recognition,comprising the steps of analyzing a spectrum of a sound signal inputtedfrom the outside; extracting predetermined language units for a speakerrecognition from a voice signal in the sound signal; measuring theloudness of each language unit; collecting voice data onregistered(background) speakers including loudness data of the pluralityof background speakers as a reference onto voice database; determiningwhether the loudness of each language unit is within a predeterminedloudness range based on the voice data base; learning each language unitby using a multi-layer perceptron in the case that at least apredetermined number of language units are within the predeterminedloudness range; and storing data on the learned language unit as datafor recognizing the speaker.

Preferably, the voice analyzing step includes the steps of representingthe voice signal of the speaker as a spectrum; and compressing thespectrum by uniformly allocating filter banks to a speaker recognitionregion in which a voice characteristics of the speaker is to berecognized.

Preferably, the speaker recognition region is 0˜3 KHz in which thefilter banks are uniformly allocated, whereas over 3 KHz the intervalsof the filter banks become logarithmically increased.

Preferably, the method further comprises the step of employing aplurality of phonemes selected from nasals, vowels, and approximantswhich include relatively lots of continuous sound as the language units,wherein the language unit extracting step includes the steps of making aplurality of frames by dividing the spectrum into several parts, andextracting a frame having the language unit among the frames.

Preferably, the loudness measuring step is comprised of calculating anenergy value of the frame having the language unit of the spectrum.

Preferably, the method further comprises the step of extracting maximumand minimum loudness by analyzing the voice spectrum of the backgroundspeakers stored in the voice database and by calculating the energyvalue of the frame having the language unit, wherein the loudnessdetermining step is comprised of determining whether the number of theframes having the loudness within the maximum and minimum loudnessoccupies a predetermined rate or more.

Preferably, the method further comprises the steps of forming aplurality of reference patterns to every language unit of the pluralityof background speakers, and forming a plurality of speaker patterns toevery language unit of the plurality of speakers, wherein the learningstep includes the step of learning a pattern characteristics of thespeaker by comparing the reference patterns with the speaker patternsaccording to a back-propagation algorithm.

Preferably, the method further comprises the step of making learninggroups as many as the number of language units of the backgroundspeakers by employing the plurality of reference patterns to everylanguage unit of one background speaker as a learning group, wherein thelearning step is comprised of learning the pattern characteristics ofthe speaker by comparing the reference patterns of every learning groupwith the plurality of the speaker patterns.

Preferably, the storing step is comprised of storing the plurality ofspeaker patterns of every language unit and the loudness of everylanguage unit as a speaker recognition data.

Preferably, the method further comprises the step of requesting thespeaker to re-utter in the case that at least the predetermined numberof language units are not within the predetermined loudness range.

According to another embodiment of the present invention, the above andother objects may be also achieved by the provision of a speakerrecognition method for recognizing whether a speaker is a registeredspeaker, comprising the steps of analyzing a spectrum of a sound signalinputted from the outside; extracting predetermined language units for aspeaker recognition from a voice signal in the sound signal; measuringthe loudness of each language unit; determining whether the loudness ofeach language unit is within a predetermined loudness range; calculatinga speaker score by calculating the probability that the language unitwill belong to the speaker through a multi-layer perceptron, and byaveraging the probability, in the case that at least a predeterminednumber of language units are within the predetermined loudness range;and verifying that the speaker is registered when the speaker score isbeyond a threshold value by comparing the calculated speaker score withthe predetermined threshold value which is a predetermined minimumspeaker score for verifying the registered speaker.

Preferably, the speaker score can be calculated from the followingequation

${Score}_{speaker} = {\frac{1}{M}{\sum\limits_{i = 0}^{M - 1}{P( {LU}_{i} )}}}$

where P(LU_(i)) is a score of the probability that the enquiring speakeris the background speaker of an i^(th) language unit frame, and M is thenumber of language unit frame extracted from an isolated word.

Further, the speaker score can be calculated on the basis of weight ofthe language units given according to verifiability.

According to another aspect of the present invention, the above andother objects may be also achieved by the provision of a voicerecognition system for voice recognition, comprising a voice analyzeranalyzing a spectrum of a sound signal inputted from the outside; avoice extractor extracting a voice signal from the sound signal andextracting predetermined language units for recognizing a speaker fromthe voice signal; voice database storing therein background speakervoice data including the loudness of a plurality of reference backgroundspeakers; a loudness determiner determining the loudness of eachlanguage unit, and determining whether the loudness of each languageunit is within a predetermined loudness range on the basis of the voicedatabase; a learner learning the language unit in the case that at leastthe predetermined number more of the language units are within thepredetermined loudness range; a memory storing data on the learnedlanguage units as recognition data for the speaker; and a controllercontrolling operations of the voice analyzer, the voice extractor, theloudness determiner and the learner when a voice is inputted, andstoring the recognition data for the speaker in the memory.

According to another embodiment of the present invention, the above andother objects may be also achieved by the provision of a speakerrecognition method for recognizing whether a speaker is a registeredspeaker, comprising a voice analyzer analyzing a spectrum of a voicesignal inputted from external sound signals; a voice extractor pickingout voice signals among inputted sound and abstracting predeterminedlanguage units for recognizing the speaker from the voice signals; aloudness determiner determining the loudness of each language unit, anddetermining whether the loudness of each language unit is within apredetermined loudness range; a speaker score calculator calculating aspeaker score by calculating probability of that the language unit willbelong to the speaker, and by averaging the probability; and acontroller controlling the speaker score calculator to calculate thespeaker score in the case that at least the predetermined number moreamong all language units is within the predetermined loudness range, andascertaining that the speaker has been registered when the speaker scoreis beyond a threshold value by comparing the calculated speaker scorewith the predetermined threshold value which is a predetermined minimumspeaker score for ascertaining the registered speaker.

BRIEF DESCRIPTION OF DRAWINGS

The present invention will be better understood and its various objectsand advantages will be more fully appreciated from the followingdescription taken in conjunction with the accompanying drawings, inwhich:

FIG. 1 is a block diagram of a voice recognition system according to thepresent invention;

FIG. 2 is a graph showing a filter bank of the voice recognition systemaccording to the present invention;

FIG. 3 is a graph showing a rate of middle distance variation betweenregistered speakers according to the filter bank allocation of FIG. 2;

FIG. 4 is a graph showing variance degree of the registered speakersaccording to the filter bank allocation of FIG. 2;

FIG. 5 is a flow chart showing the process of picking out an isolatedword in the voice recognition system according to the present invention;

FIG. 6 is a flow chart showing the process of registering a voice in thevoice recognition system according to the present invention; and

FIG. 7 is a flow chart showing the process of verifying a speaker in thevoice recognition system according to the present invention.

MODES FOR CARRYING OUT THE INVENTION

Hereinafter, the present invention will be described in more detail withreference to the accompanying drawings.

In a voice recognition system according to the present invention, an MLP(Multi-Layer Perceptron) for sampling continuants and verifying aspeaker is independently used or used together with an HMM (HiddenMarkov Model) in time of a voice recognition. The advantage of the MLPis that it is possible to learn to reject a competitive group;preliminary data on statistical characteristics of voice areunnecessary; and it is easy to embody the MLP in hardware on account ofhigh degree of a parallel computation and regularity.

In the present invention, the MLP is used for verifying a speaker.Hereinbelow, in order to show that the MLP is used in verifying aspeaker, a stochastic method for verifying the speaker will be describedfirst, and then it will be described that an operation of the MLP isbased on the stochastic method.

In the speaker verification, uttering of voice is defined as a sample Owhich is an observed queue generated by a voice model M(S) related to aspeaker S. The relation of the inputted sample O and the voice modelM(S) is expressed as a posteriori probability P(M(S)|O). A verificationprocess V(S) is performed by comparing the posteriori probabilityP(M(S)|O) with a predetermined threshold value θ.

$\begin{matrix}{{V(S)} = \{ \begin{matrix}{{reject},} & {{P( {M(S)} \middle| O )} < \theta} \\{{accept},} & {{P( {M(S)} \middle| O )} \geq \theta}\end{matrix} } & \lbrack {{Equation}\mspace{14mu} 1} \rbrack\end{matrix}$

Equation 1 shows that the speaker is rejected and accepted when theposteriori probability is lower than and higher than or equal to thethreshold value θ, respectively.

Using Bayes' rule, a posteriori probability P(M(S)|O) can be written as.

$\begin{matrix}{{P( {M(S)} \middle| O )} = \frac{{P( O \middle| {M(S)} )}{P( {M(S)} }}{P(O)}} & \lbrack {{Equation}\mspace{14mu} 2} \rbrack\end{matrix}$

Herein, because the speaker to be verified belongs to not a closed setis but an open set, it is impossible to calculate accurately not onlythe posteriori probability P(M(S)|O) which is a fixed value in a closedset but also P(O) which is an evidence of the speaker.

$\begin{matrix}{{P(O)} = {\sum\limits_{i}^{\infty}{{P( O \middle| {M( S_{i} )} )}{P( {M( S_{i} )} )}}}} & \lbrack {{Equation}\mspace{14mu} 3} \rbrack\end{matrix}$

Thus, under the condition of the uncertain P(M(S)) and P(O), P(O|M(S))cannot be used in calculating a posteriori probability.

To solve the above problem, there has been proposed a method whichaverage the P(O|M(S)) through a comparison with other speakers, i.e., asimilarity score of the enquiring speaker is averaged under a similarityscore of registered (background) speakers. A similarity ratio due to thecomparison between the speaker and the background speakers can beexpressed as follows.

$\begin{matrix}{{L(O)} = \frac{P( O \middle| {M( S_{i} )} )}{P( O \middle| {M( S_{{bg},{i \notin {bg}}} )} )}} & \lbrack {{Equation}\mspace{14mu} 4} \rbrack\end{matrix}$

Where L(O) is the similarity ratio, P(O|M(S_(i))) is a likelihoodprobability of the enquiring speaker, and P(O|M(S)) is a likelihoodprobability of the background speaker.

Using the above method, the posteriori probability P(M(S)|O) isestimated by approximately calculating Equation 3 when the backgroundspeaker set is sufficiently great to represent every enquiring speaker.

On the other hand, according to researches of Gish, the MLP embodies theabove mathematical model.

Assuming that the MLP is function of x and θ, in which x is an inputfeature vector and θ is a parameter defining the MLP, let a a targetoutput when x belongs to C_(enr) of the enquiring speaker, and b atarget output when x belongs to C_(bg) of the background speaker. Areference for estimating efficiency of the MLP can be expressed asfollows with an average squared error.

$\begin{matrix}{E = {\frac{1}{N}\lbrack {{\sum\limits_{x \in C_{enr}}\lbrack {{f( {x,\theta} )} - a} \rbrack^{2}} + {\sum\limits_{x \in C_{bg}}\lbrack {{f( {x,\theta} )} - b} \rbrack^{2}}} \rbrack}} & \lbrack {{Equation}\mspace{14mu} 5} \rbrack\end{matrix}$

Where N is the number of the samples for learning.

Thus, if N is sufficiently large, and the number of the samples of boththe speaker sets is given by a priori probability of set distribution,then the above summation can be approximated as follows.E≈∫[f(x,θ)−a] ² p(x,C _(enr))dx+∫[f(x,θ)−b] ² p(x,C _(bg))dx  [Equation6]

Where p(x,C) is a density function of a joint probability of anobservation result and an observation speaker set.

$\begin{matrix}{{{p(x)} = {{p( {x,C_{enr}} )} + {p( {x,C_{bg}} )}}}\begin{matrix}{{d(x)} = \frac{{p( {x,C_{enr}} )} + {p( {x,C_{bg}} )}}{p(x)}} \\{= {{{aP}( C_{enr} \middle| x )} + {{bP}( C_{bg} \middle| x )}}}\end{matrix}} & \lbrack {{Equation}\mspace{14mu} 7} \rbrack\end{matrix}$

Using Equation 7, the Equation 6 leads toE=∫[f(x,θ)−d(x)]² p(x)dx+a ² P(C _(enr))+b ² P(C _(bg))−∫d²(x)p(x)dx  [Equation 8]

In Equation 8, only the first term includes the parameter related to theMLP. Therefore, in order to minimize E, varying the parameters of f(x,θ)is the same as minimizing the average squared error between an output ofthe MLP and the target probability d(x).

In learning, if vector [0 1] or [1 0] is substituted for a and b for thesake of a target output of the MLP, Equation 7 is expressed as Equation9. It means that the posteriori probability of one between both thespeaker sets is selected for the sake of the target output of the MLP.d(x)=P(C _(enr) |x) or P(C _(bg) |x)  [Equation 9]

That is, according to Equation 8, the MLP learns to approach theselected posteriori probability on the basis of the average squarederror. To validate this mention, the average squared error must belowered, and to lower the average squared error, the MLP must have aproper structure.

Hereinbelow, it will be shown that an operation of the MLP includes theprocess of leveling off the posteriori probability. The output of theMLP is expressed as follows with a sigmoid function.

$\begin{matrix}{{f( {x,\theta} )} = \frac{1}{1 + {\mathbb{e}}^{- {z{({x,\theta})}}}}} & \lbrack {{Equation}\mspace{14mu} 10} \rbrack\end{matrix}$

where the Z(x,θ) is an input of the sigmoid function in an output layer.

An inverse function of Equation 10 can be expressed as follows

$\begin{matrix}{{z( {x,\theta} )} = {\log\frac{f( {x,\theta} )}{1 - {f( {x,\theta} )}}}} & \lbrack {{Equation}\mspace{14mu} 11} \rbrack\end{matrix}$

Further, for the enquiring speaker, if the output of the MLP is definedas the posteriori probability,f(x,θ)=P(C _(enr) |x)  [Equation 12]

then, the Equation 11 can be rewritten as follows.

$\begin{matrix}\begin{matrix}{{z( {x,\theta} )} = {\log\frac{P( C_{enr} \middle| x )}{1 - {P( C_{enr} \middle| x )}}}} \\{= {\log\frac{P( C_{enr} \middle| x )}{P( C_{bg} \middle| x )}}} \\{= {{\log\frac{P( x \middle| C_{enr} )}{P( x \middle| C_{bg} )}} + {\log\frac{P( C_{enr} )}{P( C_{bg} )}}}}\end{matrix} & \lbrack {{Equation}\mspace{14mu} 13} \rbrack\end{matrix}$

As a result, the similarity ratio of Equation 6 can be expressed throughthe MLP. That is, because the similarity ratio can be applied in theMLP, the posteriori probability P(M(S)|O) can be estimated from Equation3 by an approximation. Therefore, using the posteriori probability, thespeaker verification is possible in the open set through the MLP withthe similarity ratio.

On the other hand, the voice recognition system according to the presentinvention employing the MLP will be described herein below.

As shown in FIG. 1, the voice recognition system 1 according to thepresent invention comprises a learning part 5 for learning preceding thespeaker registration, a speaker verification part 7 for verifying thespeaker, and an analysis part 3 commonly used for the speakerregistration and verification.

The analysis part 3 includes a voice analyzer 11 for analyzing a voicesignal of a speaker, a voice extractor 13 extracting a voice signalamong inputted sound and extracting predetermined language units forrecognizing the speaker, a loudness determiner 15 determining theloudness of each language unit and measuring whether the loudness ofeach language unit is within a predetermined loudness range.

The learning part 5 includes a learner 23 learning about the languageunits in the case that some of the language units are within thepredetermined loudness range, a memory 25 storing data on the learnedlanguage units for the speaker recognition, and a voice database 21 inwhich the loudness and a voice characteristics of the backgroundspeakers to be compared with the enquiring speaker are stored.

The speaker verification part 7 includes a speaker score calculator 31calculating the probability of that the language unit will belong to thespeaker through the MLP in the case that some of the language units arewithin the predetermined loudness range and then calculating the speakerscore through the average of the probability, and a controller 33comparing the calculated speaker score with the predetermined thresholdvalue and verifying that the speaker has been registered when thespeaker score is beyond the threshold value.

However, because voice signals are nonlinear, speaker recognitionefficiency is not perfect. The speaker recognition rate according toresonance frequency bands of the voice signal has been measured byCristea et al. As the result of the measurement, in the case of thevoice recognition for understanding the meaning of the voice, therecognition rate was greater than 80% at 0.3 KHz˜2 KHz, whereas in thespeaker recognition for identifying the voice to whom belongs, therecognition rate was greater than 80% at 1.5 KHz˜2.7 KHz. According tothe result, Criatea et. al has improved the speaker recognition rate bynarrowing filter banks at 1.5 KHz˜2.5 KHz in comparison with 0˜1.5 KHz.

As shown in FIG. 2, according to the present invention, in compressingthe spectrum, intervals of the filter banks are uniform at 0˜3 KHz,whereas the intervals of the filter banks become increaselogarithmically over 3 KHz. At this time, two-third out of fifty filterbanks, about thirty-three are allocated to 0˜3 KHz, and the other filterbanks are logarithmically allocated over 3 KHz.

The present inventor ascertained that the above filter bank allocationmethod is more efficient than the Criatea et. al method in the speakerrecognition efficiency. Hereinbelow, it will be demonstrated by a middledistance between speakers expressed as Equation 1-1, and degree ofvariance between speaker sets expressed as Equation 1-2.Dist_(i,j)=|m_(i) −m _(j)|  [Equation 1-1]

$\begin{matrix}{J_{i,j} = \frac{{Dist}_{i,j}^{2}}{{Var}_{i} + {Var}_{j}}} & \lbrack {{Equation}\mspace{14mu} 1\text{-}2} \rbrack\end{matrix}$

As the result of the middle distance between speakers and the degree ofvariance between speaker sets which are respectively derived fromEquations 1-1 and 1-2, if the filter banks are allocated according tothe present invention in comparison with the Cristea et. al method, themiddle distance between speakers of each language unit is, as shown inFIG. 3, 20.7% distant on the average, and the degree of variance betweenspeaker sets of each language unit is, as shown in FIG. 4, 6.3% loweredon the average. Generally, the classification efficiency of the languageunit increases in proportion to the middle distance between speakers andto decrease in the degree of variance of the speaker set, and thus, asshown in FIGS. 3 and 4, both the middle distance between speakers andthe degree of variance of the speaker set are improved according to thepresent invention.

As described above, in the voice recognition system according to thepresent invention, the voice analyzer 11 compresses the spectrum in thestate that intervals of the filter banks are uniform at 0˜3 KHz, whereasthe intervals of the filter banks become logarithmically increased over3 KHz. Further, the voice analyzer 11 divides the inputted voice signalby a predetermined frame before compressing the spectrum, and thenextracts the spectrum of each frames.

According to the present invention, the language units are selected fromnasals, vowels, and approximants, which include relatively lots ofcontinuous sound, and thus total nine phonemes, /a/, /e/, /v/, /o/, /u/,/eu/, /i/, /liq/, /nas/, are employed as the language units.Hereinafter, the above language units having lots of continuous soundwill be called as continuants.

The voice extractor 13 extracts mutes, the continuants, and voicelesssound from the compressed spectrum, and detects an isolated word. Theisolated word is the unit of a language necessary for the speakerrecognition, e.g., a phrase, a word, a syllable, a phoneme, etc. Thevoice extractor 13 classifies the frames detected by the voice analyzer11 into eleven types of the mute, the nine continuants, and thevoiceless sound through a TDNN (Time-Delay Neural Network), and thenapplies a result from the TDNN and energy of each frame to an algorithmfor detecting the isolated word. Herein, the TDNN additionally includesa time-delay term in comparison with the MLP.

Hereinbelow, the process of detecting the isolated word will bedescribed with reference to FIG. 5.

First, if sound is begun (A10), then it is determined whether a soundduration is over a MinSD (Minimum Sound Duration) (A20). The MinSD isemployed as a reference for detecting the isolated word. If the soundduration is not over the MinSD, the beginning of the utterance isredetected, whereas if the sound duration is over the MinSD, it isdetected whether non-sound is begun (A30). At this time, if thenon-sound is begun, it is determined whether a non-sound duration isover a MaxNSD (Maximum Non-Sound Duration) (A40). If the non-soundduration is over a MaxNSD, the process for detecting the isolated wordis stopped.

According as the isolated word is detected, the frame including thecontinuants can be abstracted from the isolated word through the TDNN.One frame may include only one continuant or a plurality of continuants.Thereafter, the frame including the continuants is reanalyzed, and thuscan be used as speaker patterns of each continuant for the speakerrecognition and the speaker verification.

On the other hand, according as the frame including the continuants isextracted, the loudness determiner 15 calculates an energy value of thecontinuant spectrum, and determines the loudness thereof. Further, theloudness determiner 15 determines whether the loudness of the enquiringspeaker can be used in the speaker registration as compared with theloudness of the background speaker previously stored in the voicedatabase 21.

The voice database 21 is a collection of data on the voices of a largenumber of the background speakers to be compared with the enquiringspeakers, and in which the maximum and minimum loudness of eachcontinuant of the background speakers are previously stored. At thistime, the loudness of every continuant of each background speaker can becalculated with the energy value of every continuant, and expressed asfollows.

$\begin{matrix}{{{Loud}( {p,n} )} = {\frac{1}{M}{\sum\limits_{i = 0}^{M - 1}{{S( {{M \cdot n} + i} )}}}}} & \lbrack {{Equation}\mspace{14mu} 1\text{-}3} \rbrack\end{matrix}$

Where S is a voice sample, P is a continuant, M is the number of voicesample in the frame, and N is a frame number.

Using Equation 1-3, it is determined whether the enquiring speaker'sframe including the continuants is between the maximum and minimumloudness of the background speakers. At this time, the frame includingthe continuants can be registered by two methods. One is that withoutrespect to total frame of the isolated words extracted from theenquiring speaker's voice, only the frames which are between the maximumand minimum loudness of the background speakers are admitted to beregistered. The other one is that if the frames which are between themaximum and minimum loudness of the background speakers are over thepredetermined percentage, the frames are admitted to be registered.Generally, because the continuants of a long word are uttered by adifferent loudness according to accent and grammar, the latter methodconsidering an average loudness of total frames of the isolated word isdesirable.

On the other hand, the voice database 21 used in the present inventionis jointly researched for an efficiency test by Korea Institute ofTechnology and Kwangwoon University. The voice database 21 has anutterance catalog including a single numeral, a demonstrative word, afour series numeral, a short sentence, and a PBW (phone-balanced word).According to the present invention, the PBW and the four series numeralare used in the TDNN for recognizing the continuants and the MLP forverifying the speakers, respectively.

In the case that the frame including the continuants is admitted to beregistered by determining the loudness, the voice extractor 13 forms aplurality of speaker patterns corresponding to each language unit of aspeaker. The speaker patterns corresponding to each language unit of thebackground speakers are previously stored in the voice database 21.

To register a speaker, a template for a registration word correspondingto the isolated word is formed and stored, and the learning according tothe continuants is performed by the MLP. In order to store theregistration word by a template as a unit, 2-3 templates must be neededfor one word. Thus, the enquiring speaker must utter the same wordseveral times at the time of the speaker registration.

In the conventional learning of the continuants for the speakerregistration, the enquiring speaker patterns are learned for everybackground speaker pattern, and that is called an epoch. In the case oflearning a reference pattern by one epoch, because a learning suspensionreference is applied to every background speaker, discrimination ratebetween the enquiring speaker and the background speaker having apattern similar to the enquiring speaker is lowered. Herein, thelearning suspension reference is a predetermined priori changing rate.The predetermined priori changing rate is an average squared error rateemployed as a reference for determining whether or not the learningthrough the MLP is sufficient, and determined by experimentation. Theaverage squared error rate expresses the range of an error occurredbetween the background speakers.

That is, if the average squared error rate approaches the predeterminedpriori changing rate in the course of learning by comparing theenquiring speaker with the background speaker, the learner 23 suspendsthe learning. However, because the priori fixing rate is only anexperimental value, there may be possible that the background speakerhas the error occurrence range being smaller than the priori changingrate. Therefore, when the range, of an error occurred between thebackground speaker and the enquiring speaker is smaller than the priorifixing rate, the verifiability is lowered, thereby increasing a falseacceptance (FA) rate. The false acceptance rate expresses the rate offalsely accepting a non-registered speaker, and if a system accepts thenon-registered speaker, information of the system is likely to leak outby an impostor, so that the false acceptance must be decreased.

According to the present invention, in order to correctly learn thespeaker characteristics, a plurality of reference patterns formedaccording to each continuant of one background speaker is employed asone learning group. Thus, because each continuant forms the learninggroup, every background speaker has the plurality of learning groups,respectively. That is, in the case that one background speaker has ninecontinuants and each continuant has ten reference patterns, onebackground speaker has nine learning groups respectively including tenreference patterns.

Using the MLP, the learner 23 compares the reference patterns of everybackground speaker with the plurality of patterns of the enquiringspeaker, and learns the pattern property of the enquiring speakeraccording to a back-propagation algorithm. Herein, because the onelearning by comparing the reference patterns of every background speakerwith the plurality of patterns of the enquiring speaker was called theepoch, a one learning by comparing one of the learning groups of thebackground speaker with one of the patterns of the enquiring speakerwill be called a sub-epoch.

Therefore, the pattern of the enquiring speaker goes through a pluralityof sub-epochs against the reference patterns of the background speaker.Going through the plurality of sub-epochs, the reference patterns ofevery background speaker is compared with the patterns of the enquiringspeaker. At this time, the more the reference pattern of the backgroundspeaker is similar to the patterns of the enquiring speaker, the morethe learning is repeated. Thus, discrimination of the pattern betweenthe enquiring speaker and the background speaker is increased.

Thereafter, the learned patterns are stored in the memory 25, and usedas a reference value when the voice of the enquiring speaker isre-inputted.

On the other hand, the process of verifying a speaker will be describedherein below. If the enquiring speaker inputs his/her voice, theloudness determiner 15 determines whether or not at least thepredetermined number more among all language units of the isolated wordis within the predetermined loudness range. If the inputted isolatedword is not within the predetermined loudness range, the enquiringspeaker is requested to re-input his/her voice. Oppositely, if theinputted isolated word is within the predetermined loudness range, it isdetermined through a DTW (Dynamic Time Warping) algorithm whether or notthe isolated word and the registration word template are identical eachother. Then, in the case that the inputted isolated word and the storedregistration word template are identical each other, the speaker scoreis calculated by inputting the learned speaker pattern of thecontinuants abstracted by the MLP. The speaker score is derived fromEquation 1-4.

$\begin{matrix}{{Score}_{speaker} = {\frac{1}{M}{\sum\limits_{i = 0}^{M - 1}{P( {L\; U_{i}} )}}}} & \lbrack {{Equation}\mspace{14mu} 1\text{-}4} \rbrack\end{matrix}$

Where P(LU_(i)) is a probability score that the enquiring speaker is thebackground speaker of an i^(th) language unit frame, and M is the numberof language unit frame extracted from the isolated word.

The speaker score may be calculated by putting weighted value on thecontinuants of good discrimination.

Thereafter, the calculated speaker score is compared to thepredetermined threshold value, and if the calculated speaker score isbeyond the threshold value, the inputted voice is determined as thevoice of a registered speaker, thereby accepting the voice. Herein, thethreshold value is a minimum speaker score for verifying that theinputted voice is the voice of the registered speaker, and is determinedas a value only minimizing a false rejection (FR) rate because theverification of the registration word is not important in the speakerverification. The false rejection rate expresses the rate of falselyrejecting the registered speaker.

With this configuration, in the voice recognition system 1 according tothe present invention, the process of registering a voice will bedescribed herein below with reference to FIG. 6.

First, if the enquiring speaker inputs his/her voice (S10), the voiceanalyzer 11 divides the inputted voice signal by a predetermined frame(S20 ), represents it as a spectrum (S30), and compresses the spectrumthrough the filter bank, thereby picking out the isolated word (S40).Then, the voice extractor 13 picks out the frame including the languageunit among the frames of the isolated word (S50) The loudness determiner15 determines the loudness of the language unit (S60), and determineswhether or not the loudness is between the maximum and minimum loudnessof the background speakers (S70). At this time, if the loudness of theenquiring speaker language unit is not between the maximum and minimumloudness of the background speakers, the controller 33 requests theenquiring speaker to re-input his/her voice (S75).

Oppositely, if the loudness of the enquiring speaker language unit isbetween the maximum and minimum loudness of the background speakers, thepattern of every language unit of the enquiring speaker is made (S80).Further, the learner 23 compares the reference patterns of everybackground speaker with the patterns of the enquiring speaker, andlearns the pattern property of the enquiring speaker with the MLP (S90).Herein, the reference patterns of the background speaker are classifiedinto the plurality of learning groups according to each language unit,and each patterns of the enquiring speaker is compared to the referencepatterns of the background speakers according to the language unit.Then, if the learning is completed, the compared patterns and theloudness of the enquiring speaker are registered (S100).

Further, the process of the speaker verification verifying whether ornot the voice of the enquiring speaker is the registered voice will bedescribed herein below with reference to FIG. 7.

First, if the enquiring speaker inputs his/her voice (P10), the voiceanalyzer 11 divides the inputted voice signal by a predetermined frame(P20), represents it as a spectrum (P30), and picks out the isolatedword (P40). Then, the voice extractor 13 picks out the frame includingthe language unit among the frames of the isolated word (P50). Theloudness determiner 15 determines the loudness of the language unit(P60), and determines whether or not the loudness is between the maximumand minimum loudness of the background speakers (P70). At this time, ifthe loudness of the enquiring speaker language unit is not between themaximum and minimum loudness of the background speakers, the controller33 requests the enquiring speaker to re-input his/her voice (P75),whereas, if the loudness of each language unit of the enquiring speakeris between the maximum and minimum loudness of the background speakers,the speaker score calculator 31 calculates the speaker score of everylanguage unit with the MLP (P80). Thereafter, the controller 33 comparesthe calculated speaker score with the predetermined threshold value(P90), and verifying that the enquiring speaker has been registered whenthe speaker score is beyond the threshold value, thereby accepting theenquiring speaker (P100). Oppositely, in the case that the speaker scoreis below the threshold value, the controller 33 verifies that theenquiring speaker has not been registered, thereby rejecting theenquiring speaker (P105).

Hereinbelow, tables 4 through 6 show the result of the speakerregistration using the voice recognition system 1 according to thepresent invention, in which the enquiring speaker respectively utterswith 180%, 140%, 120%, 100%, 80% loudness after being respectivelyregistered with 180%, 140%, 120%, 100%, 80% loudness. Tables 1 through 3show the false reject, the false acceptance, and an isolated wordacceptance in the conventional voice recognition system, respectively.Further, the tables 4 through 6 show the false reject, the falseacceptance, and an isolated word acceptance according to the embodimentsof the present invention. Herein, the false reject expresses the rate offalsely rejecting the registered speaker, and the isolated wordacceptance expresses the rate of acceptance when the enquiring speakerutters the registration word of the background speaker.

TABLE 1 False reject in the conventional voice recognition system.Verification Learning 180% 140% 120% 100% 80% 180% 0.78 7.67 24.02 71.2347.29 140% 1.28 1.79 4.67 34.71 19.24 120% 3.58 2.21 2.80 17.53 12.67100% 30.91 20.86 14.23 2.79 34.59 80% 11.64 8.49 9.95 34.74 3.59

TABLE 2 False acceptance in the conventional voice recognition system.Verification Learning 180% 140% 120% 100% 80% 180% 25.17 12.47 7.38 2.416.99 140% 19.19 12.11 8.82 3.39 8.26 120% 14.13 10.26 8.26 3.95 7.34100% 3.91 2.97 2.79 2.79 2.49 80% 14.45 10.61 8.17 3.47 12.91

TABLE 3 Isolated word acceptance in the conventional voice recognitionsystem. Verification Learning 180% 140% 120% 100% 80% 180% 99.73 99.5299.19 99.45 97.19 140% 99.70 99.71 99.38 99.73 99.67 120% 99.62 99.6799.34 99.71 97.67 100% 99.40 99.55 99.26 99.68 97.53 80% 98.89 99.0298.82 99.14 96.86

TABLE 4 False reject in the voice recognition system according to thepresent invention. Verification Learning 180% 140% 120% 100% 80% 180%1.22 10.19 31.08 74.07 48.30 140% 2.23 2.25 6.16 37.37 18.14 120% 4.072.65 3.29 2.97 10.64 100% 34.04 22.59 16.05 2.70 32.27 80% 11.57 7.849.20 33.44 3.26

As shown in Table 4, the more the enquiring loudness is similar to thelearning loudness, the more the false reject is decreased, whereas themore the enquiring loudness is different to the learning loudness, themore the false reject is increased. Particularly, the false rejectbecomes maximized in the case that the learning loudness is higher andthe enquiring loudness is lower.

TABLE 5 False acceptance in the voice recognition system according tothe present invention. Verification Learning 180% 140% 120% 100% 80%180% 23.16 12.08 7.35 2.41 7.26 140% 17.08 11.58 8.23 3.19 8.98 120%12.72 9.72 7.72 8.46 8.15 100% 3.35 2.75 2.56 2.71 2.61 80% 13.38 10.057.66 3.26 11.85

As shown in Table 5, the false acceptance, which expresses the rate offalsely accepting the non-registered speaker, is minimized in the caseof the 100% learning or enquiring loudness. In the other case, the falseacceptance is increased. Compared with Table 2, the Table 5 shows thatthe false acceptance of the voice recognition system 1 according to thepresent invention is improved on the whole.

TABLE 6 Isolated word acceptance in the voice recognition systemaccording to the present invention. Verification Learning 180% 140% 120%100% 80% 180% 46.41 56.89 59.32 60.44 58.68 140% 57.28 75.91 80.43 82.8981.09 120% 60.04 81.66 87.70 87.70 89.42 100% 60.54 82.91 90.33 94.6393.04 80% 59.19 80.77 88.10 92.33 90.92

As shown in Table 6, the isolated word acceptance is minimized in thecase of the 180% learning and enquiring loudness. Compared with Table 3,the Table 6 shows that the isolated word acceptance of the voicerecognition system 1 according to the present invention is decreased onthe whole. Thus, the registered speaker can be the most correctlyverified by allowing the enquiring speaker to utter again when theenquiring speaker utters with insufficient loudness.

As described above, in the voice recognition system 1 according to thepresent invention, it is determined at learning the voice whether thevoice of the enquiring speaker is within the predetermined loudnessrange of the background speakers, and only the voice within thepredetermined loudness range is analyzed, thereby forming the speakerpattern. Further, it is determined at the speaker verification whetherthe voice of the enquiring speaker is within the predetermined loudnessrange of the background speakers, and the speaker scores of only thevoices within the predetermined loudness range are calculated, therebyrejecting or accepting the enquiring speaker.

As described above, in the voice recognition system 1 according to thepresent invention, the recognition is the most efficient with 100%loudness at the learning and the verification, whereas the more theloudness is different from 100%, the more the recognition efficiency isdecreased. That is, in the conventional voice recognition system therecognition efficiency and the loudness are not correlated, but in thevoice recognition system according to the present invention the isolatedword acceptance is decreased in proportional to a rise in the loudnessdifference between the enquiring and learning speakers, to thereby makethe enquiring speaker re-utter. Thus, the false acceptance expressingthe rate of falsely accepting the non-registered speaker is decreased,and the enquiring speaker has an opportunity to re-utter when his/hervoice is not within the predetermined loudness range of the backgroundspeakers, thereby improving confidence in the voice recognition system.

As described above, according to the present invention, loudness of aspeaker is considered at learning for registering his/her voice and atverifying a speaker, so that it is possible to more correctly verify thespeaker.

Although the preferred embodiments of the present invention have beendisclosed for illustrative purpose, those skilled in the art willappreciate that various modifications, additions and substitutions arepossible, without departing from the scope and spirit of the inventionas disclosed in the accompanying claims.

1. A voice registration method for voice recognition, comprising thesteps of: analyzing a spectrum of a sound signal inputted from theoutside; extracting predetermined language units for a speakerrecognition from a voice signal in the sound signal; measuring theloudness of each language unit; collecting voice data on registeredspeakers including loudness data of the plurality of background speakersas a reference onto voice database; determining whether the loudness ofeach language unit is within a predetermined loudness range based on thevoice data base; learning each language unit by using a multi-layerperceptron in the case that at least a predetermined number of languageunits are within the predetermined loudness range; and storing data onthe learned language unit as data for recognizing the speaker.
 2. Themethod according to claim 1, wherein the voice analyzing step includesthe steps of: representing the voice signal of the speaker as aspectrum; and compressing the spectrum by allocating filter banks to aspeaker recognition region in which a voice characteristics of thespeaker is to be recognized.
 3. The method according to claim 2, whereinthe storing step is comprised of storing the plurality of speakerpatterns of every language unit and the loudness of every language unitas a speaker recognition data.
 4. The method according to claim 2,wherein the speaker recognition region is 0˜3 KHz in which the filterbanks are uniformly allocated, whereas over 3 KHz the intervals of thefilter banks become logarithmically increased.
 5. The method accordingto claim 4, wherein the storing step is comprised of storing theplurality of speaker patterns of every language unit and the loudness ofevery language unit as a speaker recognition data.
 6. The methodaccording to claim 4, further comprising the step of employing aplurality of phonemes selected from nasals, vowels, and approximantswhich include relatively lots of continuous sound as the language units,wherein the language unit extracting step includes the steps of making aplurality of frames by dividing the spectrum into several parts, andextracting a frame having the language unit among the frames.
 7. Themethod according to claim 6, wherein the storing step is comprised ofstoring the plurality of speaker patterns of every language unit and theloudness of every language unit as a speaker recognition data.
 8. Themethod according to claim 6, wherein the loudness measuring step iscomprised of calculating an energy value of the frame having thelanguage unit of the spectrum.
 9. The method according to claim 8,wherein the storing step is comprised of storing the plurality ofspeaker patterns of every language unit and the loudness of everylanguage unit as a speaker recognition data.
 10. The method according toclaim 8, further comprising the step of extracting maximum and minimumloudness by analyzing the voice spectrum of the background speakersstored in the voice database and by calculating the energy value of theframe having the language unit, wherein the loudness determining step iscomprised of determining whether the number of the frames having theloudness within the maximum and minimum loudness occupies apredetermined rate or more.
 11. The method according to claim 10,wherein the storing step is comprised of storing the plurality ofspeaker patterns of every language unit and the loudness of everylanguage unit as a speaker recognition data.
 12. The method according toclaim 10, further comprising the steps of forming a plurality ofreference patterns to every language unit of the plurality of backgroundspeakers, and forming a plurality of speaker patterns to every languageunit of the plurality of speakers, wherein the learning step includesthe step of learning a pattern characteristics of the speaker bycomparing the reference patterns with the speaker patterns according toa back-propagation algorithm.
 13. The method according to claim 12,wherein the storing step is comprised of storing the plurality ofspeaker patterns of every language unit and the loudness of everylanguage unit as a speaker recognition data.
 14. The method according toclaim 12, further comprising the step of making learning groups as manyas the number of language units of the background speakers by employingthe plurality of reference patterns to every language unit of onebackground speaker as a learning group, wherein the learning step iscomprised of learning the pattern characteristics of the speaker bycomparing the reference patterns of every learning group with theplurality of the speaker patterns.
 15. The method according to claim 14,wherein the storing step is comprised of storing the plurality ofspeaker patterns of every language unit and the loudness of everylanguage unit as a speaker recognition data.
 16. The method according toclaim 1, wherein the storing step is comprised of storing the pluralityof speaker patterns of every language unit and the loudness of everylanguage unit as a speaker recognition data.
 17. The method according toclaim 1, further comprising the step of requesting the speaker tore-utter in the case that at least the predetermined number of languageunits are not within the predetermined loudness range.
 18. A speakerrecognition method for recognizing whether a speaker is a registeredspeaker, comprising the steps of: analyzing a spectrum of a sound signalinputted from the outside; extracting predetermined language units for aspeaker recognition from a voice signal in the sound signal; measuringthe loudness of each language unit; determining whether the loudness ofeach language unit is within a predetermined loudness range; calculatinga speaker score by calculating the probability that the language unitwill belong to the speaker through a multi-layer perceptron, and byaveraging the probability, in the case that at least a predeterminednumber of language units are within the predetermined loudness range;and verifying that the speaker is registered when the speaker score isbeyond a threshold value by comparing the calculated speaker score withthe predetermined threshold value which is a predetermined minimumspeaker score for verifying the registered speaker.
 19. The methodaccording to claim 18, wherein the speaker score can be calculated fromthe following equation${Score}_{speaker} = {\frac{1}{M}{\sum\limits_{i = 0}^{M - 1}\;{P( {LU}_{i} )}}}$where P(LU_(i)) is a score of the probability that the enquiring speakeris the background speaker of an i^(th) language unit frame, and M is thenumber of language unit frame extracted from an isolated word.
 20. Themethod according to claim 19, wherein the speaker score can becalculated on the basis of weight of the language units given accordingto verifiability.
 21. A voice recognition system for voice recognition,comprising: a voice analyzer analyzing a spectrum of a sound signalinputted from the outside; a voice extractor extracting a voice signalfrom the sound signal and extracting predetermined language units forrecognizing a speaker from the voice signal; a voice database storingtherein background speaker voice data including the loudness of aplurality of reference background speakers; a loudness determinerdetermining the loudness of each language unit, and determining whetherthe loudness of each language unit is within a predetermined loudnessrange on the basis of the voice database; a learner learning thelanguage unit in the case that at least a predetermined number ofadditional ones of the language units are within the predeterminedloudness range; a memory storing data on the learned language units asrecognition data for the speaker; and a controller controllingoperations of the voice analyzer, the voice extractor, the loudnessdeterminer and the learner when a voice is inputted, and storing therecognition data for the speaker in the memory.
 22. The system accordingto claim 21, wherein the voice analyzer represents the voice signal ofthe speaker as a spectrum, and compresses the spectrum by allocatingfilter banks to a speaker recognition region in which the speaker is tobe recognized, at a predetermined interval rate.
 23. The systemaccording to claim 22, wherein the speaker recognition region is 0˜3 KHzin which the filter banks are uniformly allocated, whereas over 3 KHzthe intervals of the filter banks become logarithmically increased. 24.The system according to claim 23, wherein the voice extractor makes aplurality of frames by dividing the spectrum into several parts, andextracting a frame having phonemes selected from nasals, vowels, andapproximants, which include relatively lots of continuous sound as thelanguage units the language unit, among the plurality of frames.
 25. Thesystem according to claim 24, wherein the loudness determiner calculatesan energy value of the frame having the language unit of the spectrum.26. The system according to claim 25, wherein the loudness determinerpreviously determines maximum and minimum loudness by analyzing thevoice spectrum of the background speakers stored in the voice databaseand by calculating the energy value of the frame having the languageunit, and determines whether the number of the frame having the loudnesswithin the maximum and minimum loudness is beyond a predetermined rate.27. The system according to claim 26, wherein the voice extractor formsa plurality of reference patterns corresponding to every language unitof the plurality of background speakers, and forms a plurality ofspeaker patterns to every language unit of the plurality of speakers;makes a plurality of learning groups by employing the plurality ofreference patterns to every language unit of one background speaker asone learning group.
 28. The system according to claim 27, wherein thelearner learns a pattern property of the speaker by comparing thereference patterns with the speaker patterns according to aback-propagation algorithm.
 29. The system according to claim 28,wherein in the memory are stored the plurality of speaker patterns ofevery language unit and the loudness of every language unit as a speakerrecognition data.
 30. The system according to claim 29, wherein thecontroller requests the speaker to re-utter in the case that at leastthe predetermined number more among all language units of the isolatedword is within the predetermined loudness range.
 31. A speakerrecognition system for recognizing whether a speaker is a registeredspeaker, comprising: a voice analyzer analyzing a spectrum of a voicesignal inputted from external sound signals; a voice extractor pickingout voice signals among inputted sound and abstracting predeterminedlanguage units for recognizing the speaker from the voice signals; aloudness determiner determining the loudness of each language unit, anddetermining whether the loudness of each language unit is within apredetermined loudness range; a speaker score calculator calculating aspeaker score by calculating probability of that the language unit willbelong to the speaker, and by averaging the probability; and acontroller controlling the speaker score calculator to calculate thespeaker score in the case that at least the predetermined number moreamong all language units is within the predetermined loudness range, andascertaining that the speaker has been registered when the speaker scoreis beyond a threshold value by comparing the calculated speaker scorewith the predetermined threshold value which is a predetermined minimumspeaker score for ascertaining the registered speaker.
 32. The systemaccording to claim 31, wherein the speaker score can be derived from${Score}_{speaker} = {\frac{1}{M}{\sum\limits_{i = 0}^{M - 1}\;{P( {LU}_{i} )}}}$Where P(LU_(i)) is a probability score of that the enquiring speaker isthe background speaker of an i^(th) language unit frame, and M is thenumber of language unit frame abstracted from the isolated word.
 33. Thesystem according to claim 32, wherein the speaker score calculatorcalculates the speaker score on the basis of the language unitsaccording to discrimination.